library(tidyverse)
library(rethinking)
library(dagitty)
library(knitr)
library("ggdag")
library("ggrepel")

INTRO TO EPIDEMIOLOGY

STUDY DESIGN

Types of Studies Experimental Clinical Trials Intervention Trials Prevention Trials Field Trials Observational Cross-sectional studies Cohort Studies (retrospective and prospective) Case-control Studies (including “nested”) Matched Case-control studies Ecological studies

DIAGNOSTIC TEST EVALUATION AND SCREENING TESTS

Sensitivity and specificity Predictive value positive and predictive value negative Likelihood ratios (binary, ordinal and quantitative tests) Comparison of sensitivity and specificity of 2 tests Prevalence/apparent prevalence relationship Sensitivity, specificity and predictive values of tests in series and parallel Kappa for interobserver agreement ROC curves

FOR PROJECT:sensitivity /spp of DENV PCR test and of surveillance method. Validation of PCR test How do you define dengue exposure? What’s the PRNT cut-off?

MEASURES OF DISEASE FREQUENCY

MORBIDITY

prevalence

Proportion of the population affected at time t = snapshot of disease.

Units: 0-1 or 0-100%

\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\] Risk of being a case.

\[prevalence = incidence * duration\] It’s a bad measure of risk because it depends on the duration of disease. Chronic diseases will have high prevalnce, and very fatal diseases will have low prevalence, regardless of the incidence.

  • point prevalence: prevalence at a given timepoint t

\[\frac{cases\:at\:time\:t\:(new + existing)}{total\:population\:at\:time\:t}\]

  • period prevalence: prevalence at any given timepoint during a time period t

\[\frac{cases\:observed\:over\:period\:t\:(new + existing)}{total\:population\:at\:midpoint\:of\:period\:t}\]

incidence

Proportion of the population at risk of being affected that does become affected during a time period t. cases/population * time at risk

\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]

Risk of becoming a case.

\[\frac{cases}{population*time\:at\:risk}\] It measures risk because it measures events or transitions from affected to not affected state.

  • cumulative incidence

The population at risk is a crude measure of the population at risk at the beginning of the time period. It assumes a static population at risk.

Units: 0-1 or 0-100% per time interval.

\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:at\:start\:of\:time\:period\:t}\] Measures average risk. Is apt for short time-periods or static populations.

  • incidence density rate

The population at risk is the sum of all the disease-free/ at risk time periods for each individual. It assumes the risk of each person in the population does not change over time.

Units: 0- \(\infty\) cases/population-time

\[\frac{new\:cases\:observed\:over\:period\:t}{total\:population-time}\] Measures risk by taking into account the time elapsed before disease occured for each individual, thus it also measures the speed at which disease occurs at a certain timepoint. Is apt for prolonged time-periods or dynamic populations.

To calculate population-time:

  • sum all the disease-free time for each individual.
  • estimate:
    • count the population-time midway through the time period
    • average the population-time at the beginning and the end of the period.

attack rate

Proportion of the exposed individuals that becomes affected during a time period t.

\[\frac{cases}{exposed}\]

Relationship between incidence and prevalence. Gordis

MORTALITY

mortality rate

Speed of death in time t.

Measures risk: good measure when disease is mild, bad measure when disease is very deadly and the case-fatality is high.

\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]

  • crude mortality rate

overall deaths

\[\frac{deaths\:over\:period\:t}{total\:population\:at\:risk\:during\:time\:period\:t}\]

  • specific mortality rate

deaths in a specific subgroup (age, sex, diseased with a certain disease)

\[\frac{deaths\:in\:subgroup\:over\:period\:t}{population\:at\:risk\:in\:subgroup\:during\:time\:period\:t}\]

case fatality rate

Proportion of the individuals that become affected by disease X who die during a time period t.

Measures disease severity

\[\frac{deaths}{cases}\]

proportionate mortality

Fraction of all the deaths caused by disease X

\[\frac{deaths\:from\:disease\:X}{all\:deaths}\]

RATES

crude

overall population

adjusted

adjusted rates controlling for confounding factors to remove the effect of that factor

  • direct

Apply the specific subgroup rates of each population to a standard population and calculate the rate on the standard population.

direct adjusted example from Gordis

  • indirect

Compare populations: subgroup vs general

SMR = Strandard Mortality Ratio \[SMR= \frac{Observed}{Expected}\]

Expected: Apply the general population rates to each specific subgroup and add all the cases Observed: add all the observed cases in each specific subgroup

indirect adjusted example from Gordis

CAUSAL INFERENCE

INTRODUCTION TO CAUSALITY

“The questions that motivate most studies in the health, social and behavioral sciences are not associational but causal in nature. For example, what is the efficacy of a given drug in a given population? What was the cause of death of a given individual, in a specific incident? These are causal questions because they require some knowledge of the data-generating process; they cannot be computed from the data alone, nor from the distributions that govern the data” (@Pearl2009).

"The aim of standard statistical analysis is to assess parameters of a distribution from samples drawn of that distribution. With the help of such parameters, associations among variables can be inferred, which permits the researcher to estimate probabilities of past and future events and update those probabilities in light of new information. These tasks are managed well by standard statistical analysis so long as experimental conditions remain the same. Causal analysis goes one step further; its aim is to infer probabilities under conditions that are changing, for example, changes induced by treatments or external interventions.

This distinction implies that causal and associational concepts do not mix; there is nothing in a distribution function to tell us how that distribution would differ if external conditions were to change—say from observational to experimental setup—because the laws of probability theory do not dictate how one property of a distribution ought to change when another property is modified. This information must be provided by causal assumptions which identify relationships that remain invariant when external conditions change.

Causal relations cannot be expressed in the language of probability and, hence, that any mathematical approach to causal analysis must acquire new notation – probability calculus is insufficient. To illustrate, the syntax of probability calculus does not permit us to express the simple fact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified by the conditional probability P(disease|symptom) from causal dependence, for which we have no expression in standard probability calculus." (Pearl 2010)

Summary: The difference between association and causality is that causality is directional, which cannot be represented with standard calculus notation.

CAUSATION

types of associations

A statistical association between an exposure and an outcome can be due to either or both a:

  • causal effect: the exposure causes the outcome. This is the effect we want to isolate using causal inference.

  • spurious effect: the exposed and the unexposed groups in the study are not comparable, or exchangeable, which is the ultimate source of the bias (the unexposed group is not the counterfactual of the exposed group) (@Hernan2002).

  • a common effect = collider

types of causal relationships

necessary

disease does not develop without this factor

sufficient

disease always develops with this factor

  1. necessary AND sufficient: without that factor, the disease never develops, and in the presence of that factor, the disease always develops. This never occurs in nature, even pathogens require other factors.

  2. necessary AND NOT sufficient: each factor is needed but alone is not able to cause disease. Ex: pathogen + immune susceptibility

  3. NOT necessary BUT sufficient: each factor alone is able to cause disease, but so can other factors. Ex: leukemia can be caused by radiation or benzene exposure

  4. NOT necessary NOR sufficient: presence of the factor by itself does not cuase disease

Gordis figure 14-12 to 14-15

direct effect

Example: a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner.

indirect effect

Example: age of marriage has an indirect effect by influencing the marriage rate, which then influences divorce. If people get married earlier, then the marriage rate may rise, because there are more young people. Consider for example if an evil dictator forced everyone to marry at age 65. Since a smaller fraction of the population lives to 65 than to 25, forcing delayed marriage will also reduce the marriage rate. If marriage rate itself has any direct effect on divorce, maybe by making marriage more or less normative, then some of that direct effect could be the indirect effect of age at marriage.

guidelines for establishing a causal relationship

Koch-Henle Criteria

  1. The organism is always found with the disease

  2. The organism is not found with any other disease

  3. The organism isolated from one individual with disease produces the disease in other individuals

Bradford Hill Criteria

Most important

  1. temporal relationship: exposure to the factor occurs before disease

  2. biologic plausibility: the association makes sense in the contex of existing knowledge.

  3. consistency: the same result is replicated in different studies and populations

  4. alternative explanations: confounding. exploration of the effect of other factors on the association.

Others

  1. strength: as measured by measures of effect (risk ratio or odds ratio)

  2. dose - response: the higher the exposure, the higher the risk of disease

  3. specificity: hard to ascertain as most outcomes are multifactorial

  4. cessation effect: if the exposure ceases, so does the effect

MEASURES OF DISEASE EFFECT OR ASSOCIATION

Measures of effect compare an exposed population to it’s counterfactual unexposed population, that is the exact same population at the same time point had it not been exposed. That is, the effect of \(E^+\) on the probability of being \(D^+\) in the SAME population.

Measures of association compare one exposed population to another unexposed population (a different population or the same population at a different time point) assuming that both populations are comparable. That is the effect of \(E^+\) on the probability of being \(D^+\) between \(E^-\) and \(E^+\).

causal types

Doomed = always has disease, exposed or not

Susceptible = has disease when exposed

Protected= does not have disease when exposed, but has disease when unexposed

Immune= never has disease, exposed or not

\(P(D^+|E^+) = p_1 + p_2 = doomed + susceptible\)

\(P(D^-|E^+) = p_3 + p_4 = protected + immune\)

\(P(D^+|E^-) = p_1 + p_3 = doomed + protected\)

\(P(D^-|E^-) = p_2 + p_4 = susceptible + immune\)

Disease + Disease - Total
Exposed + a = E(+) & D(+) b = E(+) & D(-) a+b = E(+)
Exposed - c = E(-) & D(+) d = E(-) & D(-) c+d = E(-)
Total a+c = D(+) b+d = D(-) a+b+c+d = total

measures of disease effect or association

Conditional probability refresher = \(P(A|B)=\frac{P(A \cup B)}{P(B)}\)

absolute risk = incidence

Measures magnitude of risk. Does not take into account the unexposed population or whether risk is associated to exposure. \[P(D^+|E^+)=\frac{a}{a+b}=\frac{new\:cases\:observed\:over\:period\:t}{total\:population\:at\:risk\:during\:period\:t}\]

relative risk = risk ratio

Measures the strength of the association and possible causal relationship

RR = 1 \(\to\) no effect

\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\] It can be expressed as:

  • Risk Ratio = RR = the ratio of cumulative incidence in the exposed and unexposed populations

\[\frac{P(D^+|E^+)}{P(D^+|E^-)}=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}=\frac{D^+\:in\:exposed\:population}{D^-\:in\:unexposed\:population}\]

  • Incidence Rate Ratio = IRR or IDR = the ratio of incidence densities in the exposed and unexposed populations

\[\frac{D^+\:in\:person-time\:of\:exposed\:population}{D^+\:in\:person-time\:of\:unexposed\:population}=\frac{incidence\:in\:exposed\:population}{incidence\:in\:unexposed\:population}\]

odds ratio

Measures the strength of the association but cannot suggest a causal relationship

\[\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}=\frac{\frac{a}{b}}{\frac{c}{d}}=\frac{ad}{bc}=\frac{odds\:D^+\:in\:exposed\:population}{odds\:D^+\:in\:unexposed\:population}\] or

\[\frac{\frac{P(E^+|D^+)}{P(E^-|D^+)}}{\frac{P(E^+|D^-)}{P(E^-|D^-)}}=\frac{\frac{a}{c}}{\frac{b}{d}}=\frac{ad}{bc}=\frac{odds\:E^+\:in\:diseased\:population}{odds\:E^+\:in\:not\:diseased\:population}\]

A: cohort study, B: case-control study, Gordis Figure 11-5

matched-pairs OR:

Gordis Figure 11-9

ODDS RATIO CAN BE A GOOD ESTIMATE OF RELATIVE RISK When:

\[\frac{P(D^+|E^+)}{P(D^+|E^-)}\approx\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}\] \[\frac{\frac{a}{a+b}}{\frac{c}{c+d}} \approx \frac{\frac{a}{b}}{\frac{c}{d}}\]

  • We assume the disease is rare:
    • rare disease assumption

When the disease does not occur frequently \(a+b \approx b\) and \(c+d \approx d\)

Gordis Figure 11-6, 11-7

  • The cases are representative, with regards to the history of exposure, of all the people with disease in the population from which the cases were drawn:

  • The controls are representative, with regards to the history of exposure, of all the people without disease in the population from which the cases were drawn:

The controls can be selected through different methods:

  • Not matched on time

    • case-based sampling: sampling occurs at the beginning of the study (\(t_0\))

    • cumulative incidence sampling: sampling occurs at the end of the study (\(t_1\))

assumption of constant incidence density rate over the period of time: \(\frac{ID_{exposed(t)}}{ID_{unexposed(t)}}=\bar{IDR}_{t_0 \to t_1}\)

assumption of a stable population with respect to exposure = time is NOT a confounder

  • Matched on time

    • incidence density sampling: match on time with the cases (\(t_0 - t_1\))

assumption of constant incidence density rate over the period of time: \(\frac{ID_{exposed(t)}}{ID_{unexposed(t)}}=\bar{IDR}_{t_0 \to t_1} \to ID_{exposed(t)} = ID_{unexposed(t)} * \bar{IDR}_{t_0 \to t_1}\)

Both incidence and exposure change in function of time, therefore time is a confounder. By selecting the controls matching on time, we can interpret the odds ratio as a rate or risk measure, without making the rare disease assumption, by assuming only that the incidence is constant over time.

INTERPRETATION OF ODDS RATIO

\[\frac{\frac{P(D^+|E^+)}{P(D^-|E^+)}}{\frac{P(D^+|E^-)}{P(D^-|E^-)}}= \frac{\frac{a/a+b}{b/a+b}}{\frac{c/c+d}{d/c+d}} \ne \frac{\frac{a}{b}}{a+b}\]

Ratio of average risk \(\frac{P(D^+|E^+)}{P(D^-|E^+)}= \frac{a/a+b}{b/a+b}\) to average survival probability \(\frac{P(D^+|E^-)}{P(D^-|E^-)}= \frac{c/c+d}{d/c+d}\)

which is not the same as the average disease odds \(\frac{a/b}{a+b}\)

attributable risk

Incidence of a disease in the exposed population that is attributable to the exposure.

If > 1 the risk in the presence of exposure is greater that not in the presence of exposure

How much of the disease would be prevented if the exposure were eliminated?

\[{P(D^+|E^+)}-{P(D^+|E^-)}={\frac{a}{a+b}}-{\frac{c}{c+d}}={incidence\:in\:exposed\:population} -{incidence\:in\:unexposed\:population}\]

as a proportion:

\[\frac{{P(D^+|E^+)}-{P(D^+|E^-)}}{P(D^+|E^+)}=\frac{{\frac{a}{a+b}}-{\frac{c}{c+d}}}{\frac{a}{a+b}}=\frac{{incidence\:in\:exposed\:population} -{incidence\:in\:unexposed\:population}}{incidence\:in\:exposed\:population}\] \[\frac{{P(D^+|E^+)}/{P(D^+|E^-)}-{P(D^+|E^-)}/{P(D^+|E^-)}}{P(D^+|E^+){P(D^+|E^-)}}= \frac{RR - 1}{RR}= 1- 1/RR\]

Grodis figure 12-1 Gordis Table 12.1

ATTRIBUTABLE FRACTIONS

  • Etiologic fraction: proportion of cases in exposed population where exposure has a biological role in the disease

  • Excess fraction: proportion of cases in exposed population where exposure has a role of incrementing the disease incidence vs the unexposed population

population attributable risk

Incidence of a disease in the total population that is attributable to the exposure.

How much of the disease in the total population would be prevented if the exposure were eliminated?

Incidence in the total population: \(P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)= \frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}= incidence\:in\:exposed\:population * proportion\:exposed\:population + incidence\:in\:unexposed\:population * proportion\:unexposed\:population\)

\[\frac{[P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)]-{P(D^+|E^-)}}{P(D^+|E^+) * P(E^+) + P(D^+|E^-) * P(E^-)}=\frac{[{\frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}}]-{\frac{c}{c+d}}}{\frac{a}{a+b} * \frac{a+b}{a+b+c+d} + \frac{a}{c+d} * \frac{c+d}{a+b+c+d}}=\frac{{incidence\:in\:total\:population} -{incidence\:in\:unexposed\:population}}{incidence\:in\:total\:population}\]

Gordis formula 12-4

ERROR IN INFERENCE

random error

Chance or random variation that remains unexplained.

The association lacks precision. The results are less reproducible.

  • unexplained variation : non-deterministic counterfactuals

  • sampling error : the degree to which a sample population deviates from the total population. It’s unpredictable and due to the sampling process.

A sample is the a subset of the subjects in the population that could have been included in the study/ a subset of the experiences the study subjects could have had.

ASSUMPTIONS OF SAMPLING

  • Randomness assumption: the sample is a random selection of the subjects in the population that could have been included in the study

  • Representativeness assumption: the sample is representative of the subjects in the population that could have been included in the study

systematic error = bias

The association lacks validity. The results are biased.

insert irva/Causal Inference: What If by Miguel A. Hernán, James M. Robins here

MEASURES OF ACCURACY

precision/reliability

the amount of random error. High precision indicates the results are always similar in different experiments.

validity

the amount of systematic error. High validity indicates proximity to the true value

  • external validity: generalizability of the results to the general population

  • internal validity: comparability among the groups in the study

Gordis figure 15-9

BIAS

Selection Information/misclassification (differential/non-differential) Confounding Methods for identifying/detecting confounding Methods for controlling confounding

The exposed and the unexposed in the study are not comparable, or exchangeable, which is the ultimate source of the bias ( the unexposed group is not the counterfactual of the exposed group)

  • There is confounding when the association between exposure and outcome includes a noncausal component attributable to their having an uncontrolled common cause.

  • There is selection bias when the association between exposure and outcome includes a noncausal component attributable to restricting the analysis to certain level(s) of a common effect of exposure and outcome or, more generally, to conditioning on a common effect of variables correlated with exposure and outcome.

counfounding

Confounder

a variable that is:

  • associated with the outcome, conditional on the exposure (i.e. in the exposed group)
  • associated with the exposure, conditional on the exposure (i.e. in the exposed group)
  • not on the causal pathway between the exposure and the outcome.

INTERACTION (EFFECT MEASURE MODIFICATION)

Additive Multiplicative Absolute vs. Relative Measures of Effect

description of the example

We will look into causal inference using a working example from Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition by Richard McElreath: Correlation between marriage rate (the exposure) and divorce rate (the outcome).

There are three observed variables in play: divorce rate (D), marriage rate (M), and the median age at marriage (A) in each State of the U.S. Both marriage rates and median age at marriage are great predictors of the divorce rate in a given State, but are these relationships causal?

The rate at which adults marry is a great predictor of divorce rate. But does marriage cause divorce? In a trivial sense it obviously does: One cannot get a divorce without first getting married. But there’s no reason high marriage rate must cause more divorce. It’s easy to imagine high marriage rate indicating high cultural valuation of marriage and therefore being associated with low divorce rate.

Age at marriage is also a good predictor of divorce rate— higher age at marriage predicts less divorce. But there is no reason this has to be causal, either, unless age at marriage is very late and the spouses do not live long enough to get a divorce.

# load data and copy 
library(rethinking) 
data(WaffleDivorce) 
d <- WaffleDivorce
# standardize variables
d$D <- standardize( d$Divorce )
d$M <- standardize( d$Marriage )
d$A <- standardize( d$MedianAgeMarriage )

\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha + \beta_{A}A_{i}\)

The outcome and the predictor are both standardized, the intercept α should end up very close to zero.

What does the prior slope \(\beta_{A}\) imply? If \(\beta_{A}\) = 1, that would imply that a change of one standard deviation in age at marriage is associated likewise with a change of one standard deviation in divorce. To know whether or not that is a strong relationship, you need to know how big a standard deviation of age at marriage is:

 sd( d$MedianAgeMarriage )
## [1] 1.24363

So when \(\beta_{A}\) = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable. That seems like an insanely strong relationship.

m5.1 <- quap(
    alist(
        D ~ dnorm( mu , sigma ) ,
        mu <- a + bA * A ,
        a ~ dnorm( 0 , 0.2 ) ,
        #when βA = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable (divorce)
        #only 5% of plausible slopes more extreme than 1.
        bA ~ dnorm( 0 , 0.5 ) ,
        sigma ~ dexp( 1 )
) , data = d )

precis(m5.1)
##                mean         sd       5.5%      94.5%
## a      2.210467e-08 0.09737880 -0.1556301  0.1556301
## bA    -5.684035e-01 0.10999985 -0.7442045 -0.3926025
## sigma  7.883260e-01 0.07801142  0.6636487  0.9130033

posterior for \(\beta_{A}\) is reliably negative, as seen:

# compute percentile interval of mean
A_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.1 , data=list(A=A_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ A , data=d , col=rangi2 )
lines( A_seq , mu.mean , lwd=2 )
shade( mu.PI , A_seq )

\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha + \beta_{M}M_{i}\)

m5.2 <- quap(
    alist(
        D ~ dnorm( mu , sigma ) ,
        mu <- a + bM * M ,
        a ~ dnorm( 0 , 0.2 ) ,
        bM ~ dnorm( 0 , 0.5 ) ,
        sigma ~ dexp( 1 )
) , data = d )

precis(m5.2)
##               mean         sd       5.5%     94.5%
## a     8.512747e-08 0.10824652 -0.1729988 0.1729989
## bM    3.500538e-01 0.12592759  0.1487972 0.5513104
## sigma 9.102665e-01 0.08986268  0.7666486 1.0538844
# compute percentile interval of mean
M_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m5.2 , data=list(M=M_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ M , data=d , col=rangi2 )
lines( M_seq , mu.mean , lwd=2 )
shade( mu.PI , M_seq )

This relationship isn’t as strong as the previous one.

The pattern we see in the previous two models is symptomatic of a situation in which only one of the predictor variables, A in this case, has a causal impact on the outcome, D, even though both predictor variables are strongly associated with the outcome.

causal effect

The total causal effect is the sum of the direct and indirect effects

Example: age of marriage influences divorce in two ways.

direct effect

Example: a direct effect would arise because younger people change faster than older people and are therefore more likely to grow incompatible with a partner.

indirect effect

Example: age of marriage has an indirect effect by influencing the marriage rate, which then influences divorce. If people get married earlier, then the marriage rate may rise, because there are more young people. Consider for example if an evil dictator forced everyone to marry at age 65. Since a smaller fraction of the population lives to 65 than to 25, forcing delayed marriage will also reduce the marriage rate. If marriage rate itself has any direct effect on divorce, maybe by making marriage more or less normative, then some of that direct effect could be the indirect effect of age at marriage.

spurious effect

The exposed and the unexposed in the study are not comparable, or exchangeable, which is the ultimate source of the bias ( the unexposed group is not the counterfactual of the exposed group)

  • There is confounding when the association between exposure and outcome includes a noncausal component attributable to their having an uncontrolled common cause.

  • There is selection bias when the association between exposure and outcome includes a noncausal component attributable to restricting the analysis to certain level(s) of a common effect of exposure and outcome or, more generally, to conditioning on a common effect of variables correlated with exposure and outcome.

counfounding

Confounder

a variable that is:

  • associated with the outcome, conditional on the exposure (i.e. in the exposed group)
  • associated with the exposure, conditional on the exposure (i.e. in the exposed group)
  • not on the causal pathway between the exposure and the outcome.

strategies to identify confounders

  • Methods relying on statistical associations that can easily be identified from the data:
  1. automatic variable selection procedures: i.e stepwise regression. It assumes that all important confounders will be selected

  2. change in effect estimate: comparison of the effect estimates between adjusted and unadjusted effect estimates. The variable is selected as a confounder if there is a relative change greater than 10%. It assumes that any variable substantially associated with an estimate change is worth adjusting for.

  • Methods that combine statistical associations from the data with some background knowledge about the causal network that links exposure, outcome, and potential confounders.
  1. check whether the variable meets the criteria of a confounder: combines information from statistical associations and background knowledge.

Statistical criteria alone are insufficient to characterize either confounding or selection bias.

The presence of common causes, and therefore of confounding, can be represented by causal diagrams known as directed acyclic graphs (DAGs).

DAG = directed acyclic graph

diagrams that link variables by arrows that represent direct causal effects (protective or causative) of one variable on another.

There are only four types of variable relations that combine to form all possible paths:

  1. the CONFOUNDER = fork: X ← Z → Y. This is the classic confounder: some variable Z is a common cause of X and Y, generating a correlation between them. If we condition on Z, then learning X tells us nothing about Y. X and Y are independent, conditional on Z.

  2. the PIPE = intermediary: X → Z → Y. The treatment X influences Z which influences Y. If we condition on Z, we block the path from X to Y. X and Y are independent, conditional on Z.

  3. the COLLIDER = common effect: X → Z ← Y. Conditioning on Z, the collider variable, opens the path. X and Y are dependent, conditional on Z, however neither X nor Y has any causal influence on the other.

  4. the DESCENDENT = association?: Z \(\to\) D. Descendent is a variable influenced by another variable. Conditioning on a descendent partly conditions on its parent. Conditioning on D will also condition, to a lesser extent, on Z because D has some information about Z.

Backdoor criterion

Path : any series of variables you could walk through to get from one variable to another, ignoring the directions of the arrows.

Blocking all confounding paths between some predictor X and some outcome Y is known as shutting the backdoor, thus eliminating spurious associations that are non-causal.

Example:

There are two paths connecting E and O: (1) E → O (2) E ← C → O.

Both of these paths create a statistical association between E and O. But only the first path is causal. The second path is non-causal. If only the second path existed, and we changed E, it would not change O. Any causal influence of E on O operates only on the first path.

strategies to control confounders

Manipulation of the confounding factor in the study design:

Manipulation removes the influence of C on E: when we determine E, the C variable does not influence E, thus blocking the non-causal path between E and O (E ← C → O). Once the path is blocked, there is only one way for information to go between E and O, and then measuring the association between E and O would yield a useful measure of causal influence.

  • study design:
    • randomization
    • restriction
  • study design + study analysis:
    • matching

Conditioning on the confounding factor in the study analysis:

Adding C to the model blocks the non-causal path E ← C → O.

Why? Think of this path in isolation, as a complete model.

Once you learn C, also learning E will give you no additional information about O.

Example: Suppose for example that C is the average wealth in a region. Regions with high wealth have better schools, resulting in more education (exposure E), as well as better paying jobs, resulting in higher wages (outcome O). If you don’t know the region a person lives in, learning the person’s education E will provide information about their wages O, because E and O are correlated across regions. But after you learn which region a person lives in, assuming there is no other path between E and O, then learning E tells you nothing more about O. This is the sense in which conditioning on C blocks the path—it makes E and O independent, conditional on C.

  • study analysis
    • stratification:
    • standarization
    • inverse probability treatment
    • multivariate adjustment

Obtaining an unbiased estimate of the causal effect:

  1. List all of the paths connecting E (the potential cause of interest) and O (the outcome).
  2. Classify each path by whether it is open or closed. A path is open unless it contains a collider.
  3. Classify each path by whether it is a backdoor path. A backdoor path has an arrow entering E.
  4. If there are any open backdoor paths, decide which variable(s) to condition on to close it (if possible).
  • Obtaining an unbiased estimate of the total causal effect requires measuring and adjusting for all confounders of the E \(\to\) O association

  • Obtaining an unbiased estimate of the direct causal effect requires measuring and adjusting for all confounders of both the

    • J \(\to\) O association
    • E \(\to\) O association

Do-operator

do(E) closes the backdoor paths into E, as in a manipulative experiment.

P(O|do(E)) defines a causal relationship because it tells us the expected result of manipulating E on O

*Confounding: P(O|E) \(\ne\) P(O|do(E)). The relationship between the E and O when the backdoor paths are closed is not the same, indicating that there is confounding.

*Conditional probability, non-causal: P(O|E) \(\ne\) P(O|not-E) doesn’t close the backdoor, and therefore does not give a causal relationship.

*Total causal relationship: if P(O|do(E)) \(\ne\) P(O|not-E), then E is the cause of O.

*Direct causal relationship: might require closing more backdoor paths.

  • To obtain the total causal effect we condition on C but not J:

    • C is a confounder of the J \(\to\) O association so we must condition on it to obtain the unbiased total causal effect
dag1.4 %>% 
ggdag_dseparated(from = "E", to = "O", controlling_for = "C")+
  theme_dag()

  • J is a collider of the E \(\to\) C association so if we condition on J we create a backdoor path between O \(\to\) E through C.
dag1.4 %>% 
ggdag_dseparated(from = "E", to = "O", controlling_for = "J")+
  theme_dag()

  • To obtain the direct causal effect we condition on both J and C.
dag1.4 %>% 
ggdag_dseparated(from = "E", to = "O", controlling_for = c("J", "C"))+
  theme_dag()

  • To obtain the total causal effect we condition on D
dag1.5 %>% 
ggdag_dseparated(from = "E", to = "O", controlling_for = "D")+
  theme_dag()

  • To obtain the total causal effect we condition on C, D, J
dag1.5 %>% 
ggdag_dseparated(from = "E", to = "O", controlling_for = c("C", "D", "J"))+
  theme_dag()

Divorce rate example:

To infer the strength of these different arrows, we need more than one statistical model.

  • To obtain the total effect of of age at marriage on divorce rate we condition on age at marriage.

The total causal effect is the sum of the direct and indirect effects

Model m5.1, the regression of D on A, tells us only that the total influence of age at marriage is strongly negative with divorce rate. The “total” here means we have to account for every path from A to D. There are two such paths in this graph: A → D, a direct path,and A → M → D, an indirect path.

\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha + \beta_{A}A_{i}\)

m5.1 <- quap(
    alist(
        D ~ dnorm( mu , sigma ) ,
        mu <- a + bA * A ,
        a ~ dnorm( 0 , 0.2 ) ,
        #when βA = 1, a change of 1.2 years in median age at marriage is associated with a full standard deviation change in the outcome variable (divorce)
        #only 5% of plausible slopes more extreme than 1.
        bA ~ dnorm( 0 , 0.5 ) ,
        sigma ~ dexp( 1 )
) , data = d )

precis(m5.1)
##                mean         sd       5.5%      94.5%
## a      5.471950e-07 0.09737869 -0.1556294  0.1556305
## bA    -5.684038e-01 0.10999969 -0.7442045 -0.3926030
## sigma  7.883249e-01 0.07801114  0.6636480  0.9130018
dag1.3 %>% 
ggdag_dseparated(from = "A", to = "D", controlling_for = "A")+
  theme_dag()

In general, it is possible that a variable like A has no direct effect at all on an outcome like D. It could still be associated with D entirely through the indirect path. That type of relationship is known as mediation.

As you’ll see however, the indirect path does almost no work in this case. How can we show that?

We know from m5.2 that marriage rate is positively associated with divorce rate. But that isn’t enough to tell us that the path M → D is positive. It could be that the association between M and D arises entirely from A’s influence on both M and D. Like this:

This DAG is also consistent with the posterior distributions of models m5.1 and m5.2. Why? Because both M and D “listen” to A. They have information from A. So when you inspect the association between D and M, you pick up that common information that they both got from listening to A.

So which is it? Is there a direct effect of marriage rate, or rather is age at marriage just driving both, creating a spurious correlation between marriage rate and divorce rate? To find out, we need to consider carefully what each DAG implies.

TESTABLE IMPLICATIONS

Testable implications can be read off the diagrams using a graphical criterion known as d- separation (Pearl, 1988). Each diagram encodes causal assumptions, each corresponding to a missing arrow or a missing double-arrow between a pair of variables.

DAGs imply that some variables are independent of others under certain conditions, therefore the testable implications of a DAG are it’s CONDITIONAL INDEPENDENCIES.

CONDITIONAL INDEPENDENCIES describe which variables should be associated with one another (or not) in the data, and which variables become disassociated when we condition on some other set of variables.

Condition independencies are pairs of variables that are not associated, once we condition on some set of other variables.

Conditioning: conditioning on a variable Z means learning its value and then asking if X adds any additional information about Y. If learning X doesn’t give you any more information about Y, then we might say that Y is independent of X conditional on Z. This conditioning statement is sometimes written as: \(Y \!\perp\!\!\!\perp X|Z\)

\(X \not\!\perp\!\!\!\perp Y\) means “not independent of”

\(X \!\perp\!\!\!\perp Y\) means “independent of”

In our divorce example

If we look in the data and find that any pair of variables are not associated, then something is wrong with the DAG (assuming the data are correct). In these data, all three pairs are in fact strongly associated. Check for yourself. You can use cor to measure simple correlations. Correlations are sometimes terrible measures of association—many different patterns of association with different implications can produce the same correlation. But they do honest work in this case.

cor(d$D, d$M)
## [1] 0.3737314
cor(d$D, d$A)
## [1] -0.5972392
cor(d$M, d$A)
## [1] -0.721096
  • First DAG: M has influence on D.
ggdag(dag1.3) +
  theme_dag()

This DAG says:

  1. A directly influences D
  2. M directly influences D
  3. A directly influences M

There are 3 causal assumptions that can be tested (one for every arrow).

Before we condition on anything, we assume everything is associated with everything else.

The testable implications are:

  1. \(D \not\!\perp\!\!\!\perp A\) A not independent of D
  2. \(D \not\!\perp\!\!\!\perp M\) M not independent of D
  3. \(A \not\!\perp\!\!\!\perp M\) D not independent of A

+implied conditional independencies = none

DMA_dag1 <- dagitty('dag{ D <- A -> M -> D }')
impliedConditionalIndependencies( DMA_dag1  )
  • Second DAG: M has no influence on D.
ggdag(dag1.2) +
  theme_dag()

In this DAG, it is still true that all three variable are associated with one another. A is associated with D and M because it influences them both. And D and M are associated with one another, because M influences them both. They share a cause, and this leads them to be correlated with one another through that cause. There are 3 causal assumptions that can be tested (one for every arrow). Before we condition on anything, we assume everything is associated with everything else.

  1. A causes D

(2) M causes D

  1. A causes M

But suppose we condition on A. All of the information in M that is relevant to predicting D is in A. So once we’ve conditioned on A, M tells us nothing more about D. So in the second DAG, a testable implication is that D is independent of M, conditional on A. In other words, \(D \!\perp\!\!\!\perp M|A\)

+The testable implications are:

All 3 variables should be associated, before conditioning on anything:

  1. \(D \not\!\perp\!\!\!\perp A\) A not independent of D

  2. \(D \not\!\perp\!\!\!\perp M\) M not independent of D

  3. \(A \not\!\perp\!\!\!\perp M\) D not independent of A

  4. \(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.

  • implied conditional independencies = D || M | A
DMA_dag2 <- dagitty('dag{ D <- A -> M }') 
impliedConditionalIndependencies( DMA_dag2  )
## D _||_ M | A
  • Test the difference betweet the two DAGs

The only implication that differs between these DAGs is the last one:\(D \!\perp\!\!\!\perp M|A\) D and M should be independent after conditioning on A.

To test this implication, we need a statistical model that conditions on A, so we can see whether that renders D independent of M. And that is what multiple regression helps with. It can address a useful descriptive question: Is there any additional value in knowing a variable, once I already know all of the other predictor variables?

So for example once you fit a multiple regression to predict divorce using both marriage rate and age at marriage, the model addresses the questions: (1) After I already know marriage rate, what additional value is there in also knowing age at marriage? (2) After I already know age at marriage, what additional value is there in also knowing marriage rate?

The parameter estimates corresponding to each predictor are the (often opaque) answers to these questions. The questions above are descriptive, and the answers are also descriptive. It is only the derivation of the testable implications above that give these descriptive results a causal meaning. But that meaning is still dependent upon believing the DAG.

For each predictor, the parameter measures its conditional association with the outcome.

\(D_{i} ∼ Normal(\mu_{i}, \sigma)\)

\(\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}\)

m5.3 <- quap(
    alist(
        D ~ dnorm( mu , sigma ) ,
        mu <- a + bM*M + bA*A ,
        a ~ dnorm( 0 , 0.2 ) ,
        bM ~ dnorm( 0 , 0.5 ) ,
        bA ~ dnorm( 0 , 0.5 ) ,
        sigma ~ dexp( 1 )
    ) , data = d )
precis( m5.3 )
##                mean         sd       5.5%      94.5%
## a      2.950299e-05 0.09708055 -0.1551240  0.1551830
## bM    -6.523122e-02 0.15078252 -0.3062108  0.1757484
## bA    -6.134182e-01 0.15099278 -0.8547339 -0.3721026
## sigma  7.851658e-01 0.07785526  0.6607381  0.9095935
dag1.3 %>% 
ggdag_dseparated(from = "A", to = "D", controlling_for = c("A", "M"))+
  theme_dag()

The posterior mean for marriage rate, bM, is now close to zero, with plenty of probability of both sides of zero. The posterior mean for age at marriage, bA, is essentially unchanged. It will help to visualize the posterior distributions for all three models, focusing just on the slope parameters βA and βM:

plot(coeftab(m5.1,m5.2,m5.3), par=c("bA","bM"))

bA doesn’t move, only grows a bit more uncertain, while bM is only associated with divorce when age at marriage is missing from the model. You can interpret these distributions as saying: Once we know median age at marriage for a State, there is little or no additional predictive power in also knowing the rate of marriage in that State, which means \(D \!\perp\!\!\!\perp M|A\). D and M are independent after conditioning on A, which corresponds to the second DAG.

Note that this does not mean that there is no value in knowing marriage rate. Consistent with the earlier DAG, if you didn’t have access to age-at-marriage data, then you’d definitely find value in knowing the marriage rate. M is predictive but not causal. Assuming there are no other causal variables missing from the model, this implies there is no important direct causal path from marriage rate to divorce rate. The association between marriage rate and divorce rate is spurious, caused by the influence of age of marriage on both marriage rate and divorce rate.

We’re interested in the total causal effect of the number of Waffle Houses on divorce rate in each State. Presumably, the naive correlation between these two variables is spurious. What is the minimal adjustment set that will block backdoor paths from Waffle House to divorce?

Let’s make a graph:

## { A, M }
## { S }

We could control for either A and M or for S alone. This DAG is obviously not satisfactory—it assumes there are no unobserved confounds, which is very unlikely for this sort of data. But we can still learn something by analyzing it. While the data cannot tell us whether a graph is correct, it can sometimes suggest how a graph is wrong.

Inspecting implied conditional independencies, we can at least test some of the features of a graph.

 impliedConditionalIndependencies( dag6 )
## A _||_ W | S
## D _||_ S | A, M, W
## M _||_ W | S
  1. The median age of marriage should be independent of (||) Waffle Houses, conditioning on (|) a State being in the south.

  2. Divorce and being in the south should be independent when we simultaneously condition on all of median age of marriage, marriage rate, and Waffle Houses.

  3. Marriage rate and Waffle Houses should be independent, conditioning on being in the south.

include: colliders, multicollinearity, post-treatment bias. end of chapter 6

INTERACTIONS chapter 8

EPI208

Testing vs estimation:

Hypothesis testing: P-value does not give information about: * direction of association * magnitude of the effect (it mixes precision with magnitude) * it depends on sample size * the clinical relevance is not clear.

Estimation: effect size (OR, RR) and precision (95% CI) are separated.
Predictive vs estimation models Goal: Predicition: the goal is to determine combination of factors that provides the best prediction of an outcome. variable selection determined by strength of association

Estimation of association: determine the unbiased association of one factor with another. variable selection determined by change in the estimate of association

CONFOUNDING =/ effect modifier =/ interaction